1 Introduction

This practice is part of the subject Biomedical Data Science of the Degree in Data Science of the Universitat Politècnica de València, and taught by the Department of Applied Physics.

The measurement of data quality dimensions (DQ) is the central axis for the evaluation and improvement of data quality as well as for its correct and optimal use. Data quality dimensions are individual aspects or constructs that represent data quality attributes. To these can be associated one or more metrics, quantified with specific methods, as well as exploratory methods.

This practice is intended to provide an initial basis for the evaluation of DQ metrics. It will consist of the application of a series of methods for different dimensions of DQ. In the context of the maternal and child clinical setting, we will analyze a data file whose ultimate purpose is the monitoring of care indicators in this setting. Depending on the dimension, we will apply the methods and calculate the metrics both in general for the whole dataset and monitored over batches of time (by months), simulating the results of a DQ monitoring and continuous improvement system.

In some parts of the code we will find the text ##TODO## that we will need to complete. Additionally, we will have to discuss the results in those points where it is indicated. The deliverable of the practice will consist of the compilation in html of this R markdown file, using Knit, where the results of the execution and figures will be observed, and having completed the ##TODO## and commented the results.

2 Preparation of the work environment

We check that the working directory is in the one where we have the practice file and the folder with the data:

getwd()

Otherwise, we set it (change the example directory to ours):

setwd("D:/PRACTICAS/P3")

We install the required libraries and then load them into the working environment.

install.packages("zoo", repos = "http://cran.us.r-project.org")
install.packages("rts", repos = "http://cran.us.r-project.org")
install.packages("plotly", repos = "http://cran.us.r-project.org")
install.packages("devtools", repos = "http://cran.us.r-project.org")
library("devtools")
devtools::install_github('c5sire/datacheck')
devtools::install_github("hms-dbmi/EHRtemporalVariability")
library("zoo")
library("rts")
library("plotly")
library("datacheck")
library("EHRtemporalVariability")

3 Data loading

We set the initial parameters of the data. The main date of the records, which will be used for the purpose of monitoring the delivery care indicators, is the date of birth.

# File name
fileName = "data/DQIinfantFeeding.csv"
# Whether it has a header or not
hasHeader = TRUE
# Main date column to be used for monitoring purposes
dateColumn = "BIRTH_DATE"
# Format of the previous date
dateFormat = '%d/%m/%Y'
# Which text string will represent missing data
missingValue = ""

We load the file data/DQIinfantFeeding.csv in a data.frame named repository:

repository <- read.csv2(fileName, header=hasHeader, na.strings=missingValue)

# We collect the number of rows and columns

N <- nrow(repository)
D <- ncol(repository)

For monitoring purposes, we will use the zoo library (S3 Infrastructure for Regular and Irregular Time Series - Z’s Ordered Observations) to convert the data, the data.frame, to a format suited for batch analyses, the zoo format.

zooRepository <- read.zoo(repository,format = dateFormat,index.column = dateColumn)

4 Problem in the monitoring of indicators

One of the main uses of the maternal and infant data repository studied is the monitoring of quality of care indicators. In the field of newborn feeding, one of the most important indicators is whether there has been early initiation of breastfeeding in the delivery room. To calculate this indicator, we create the following function that will obtain the indicator for each batch of data received, so that we can apply it repeatedly for each batch given a frequency.

# how much woman started to breastfeed during the 1st hour in delivery room between all woman who didn't declare they don't want ot breastfeed.

indicatorEBF_delroom <- function(dataset){
  
  numerator = (dataset$NEO_MOMFBF_TYPE %in% 'During the 1st hour') &
    (dataset$NEO_PFBF_TYPE %in% 'Delivery room') &
    (dataset$DEL_MODE %in% c('Vaginal delivery', 'Thierry\'s spatulas', 'Forceps delivery', 'Vacuum delivery'))

  denominator = (dataset$NEO_MOMFBF_TYPE %in% c('During the 1st hour', 'During the 2nd hour', 'During the 3rd hour','Breastfeeding does not start')) &
    (dataset$NEO_PFBF_TYPE %in% c('Delivery room', 'Hospitalization unit', 'Breastfeeding does not start')) &
    !(dataset$NEO_FBFEVAL_TY %in% 'Undesired breastfeeding') &
    (dataset$DEL_MODE %in% c('Vaginal delivery', 'Thierry\'s spatulas', 'Forceps delivery', 'Vacuum delivery'))

  indicator = sum(numerator)/sum(denominator) * 100
  
  if (sum(denominator) == 0){
    print("denominator is 0")
  }
  
  return(indicator)
}

Once the function is loaded in the environment, we can easily apply it to the batches of data at the desired time frequency using the apply family of functions from the xts (Raster Time Series Analysis) library. In this monthly case, we will use apply.monthly, to which we will pass as parameters the repository converted to zoo and the previously created function:

resIndicatorES2SC_delroom =apply.monthly(zooRepository, FUN=indicatorEBF_delroom)
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
## [1] "denominator is 0"
plot(resIndicatorES2SC_delroom,xlab = "Date", ylab ="%",main = "Early breastfeeding start in the delivery room", ylim=c(0,100))

#resIndicatorES2SC_delroom

📝 DISCUSS RESULTS

The indicator describes how much woman who deliverd in certain ways started to breastfeed during the 1st hour in delivery room between all such woman who didn’t declare they don’t want ot breastfeed.

We can see that overall indicator is quite high, fluctuating between around 77 and 100%. Between IIQ 2009 and IIQ 2011 we don’t have woman matching our denominator, disturbing us from calculating the indicator values during this period. It is possible that this was a trend of declaring not breastfeeding, many woman delivered by caesarea, we have some missing values in our dataset or other case occured. We’ll examine some of these possibilities further.


5 Completeness

5.1 General

We will find the missing data in the repository and calculate the corresponding metrics. First, for each variable:

NAmatrix <- !is.na(repository)
sumNAmatrix <- apply(NAmatrix,2,sum)
completenessByColumn <- round(sumNAmatrix/N*100,2)
knitr::kable(completenessByColumn)
x
PATIENT_ID 100.00
DEL_TYPE_A 41.62
DEL_MODE 96.50
DEL_TYPE_ANE 96.45
DEL_IN_INTRACA 3.66
BIRTH_DATE 100.00
NEO_ES2SC_TYPE 98.08
NEO_ES2SCP_TYPE 97.79
NEO_MOMFBF_TYPE 98.13
NEO_PFBF_TYPE 97.87
NEO_FBFEVAL_TY 97.79
NEO_APGAR_NUM 31.31
NEO_WEIGHT_NUM 95.42
NEO_PHCORD_TXT 70.64
NEO_BF_DIS 95.90

Next, we will calculate and display the overall percentage of missing data:

completenessByDataset = round(sum(NAmatrix)/(N*D)*100,2) #round(sum(completenessByColumn)/D, 2)
completenessByDataset
## [1] 81.41
knitr::kable(table(repository$DEL_TYPE_ANE), format = "markdown")
Var1 Freq
0 6
1 1
epidural 351
Epidural anesthesia 964
Epidural anesthesia / General anesthesia 4
general 13
General anesthesia 17
local 290
Local anesthesia 542
no 298
No anesthesia 101
sedation 3
spinal 161
Spinal 285
Spinal anesthesia 90
Spinal anesthesia / General anesthesia 16
Without anesthesia 524

📝 DISCUSS RESULTS

We have 81.41% of complete data. The most incomplete column is DEL_IN_INTRACA, which has only 3.66% values, which means we probably can’t use it for models. We also have many missing values – over 50% in Apgar score (NEO_APGAR_NUM) and TYPE_A. The number of missing values for these columns may also exclude these feature from prediction further. NEO_PHCORD_TXT has over 70% complete values, so we can try to take care of it. Other features are quite complete, however, only patient id and birthdate were filled for all cases.


5.2 Monitoring

To monitor the completeness by temporary batches we will create a function that does this calculation for each batch it receives as a parameter, and returns the completeness for each column, the function dimCompletessByColumn:

dimCompletenessByColumn <- function(repository){
  N = dim(repository)[1]
  NAmatrix <- !is.na(repository)
  sumNAmatrix <- apply(NAmatrix,2,sum)
  completenessByColumn <- round(sumNAmatrix/N*100,2)
  return(completenessByColumn)
}

Once the function is loaded in the environment, we can easily apply it to the batches of data at the desired time frequency using the apply family of functions from the xts (Raster Time Series Analysis) library. In this monthly case, we will use apply.monthly, to which we will pass as parameters the repository converted to zoo and the previously created function:

resCompletenessByColumn = apply.monthly(zooRepository, FUN=dimCompletenessByColumn)

Now, we can create a plot with the results using the plotly library (Create Interactive Web Graphics via ‘plotly.js’). First for each variable:

p <-
  plot_ly(
    x = index(resCompletenessByColumn),
    y = resCompletenessByColumn[, 1],
    name = names(resCompletenessByColumn)[1],
    type = 'scatter',
    mode = 'lines'
  ) %>%
  plotly::layout(
    title = 'Completeness by month',
    xaxis = list(title = "Date"),
    yaxis = list (title = "Completeness (%)")
  )

 for (i in 2:ncol(resCompletenessByColumn)) {
  p = p %>% plotly::add_trace(y = resCompletenessByColumn[, i],
    name = names(resCompletenessByColumn)[i],
    mode = 'lines')
}

p

And secondly globally, being able to calculate the result from the variable resCompletenessByColumn and use the code to plot a single series from the previous code chunk:

globalCompleteness <- rowMeans(resCompletenessByColumn, na.rm = TRUE)

p <-
  plot_ly(
    x = index(resCompletenessByColumn),
    y = globalCompleteness,
    name = names(globalCompleteness),
    type = 'scatter',
    mode = 'lines'
  ) %>%
  plotly::layout(
       title = 'Completeness by month',
    xaxis = list(title = "Date"),
    yaxis = list (title = "Completeness (%)")
  )

p

📝 DISCUSS RESULTS

The incompleteness is fluctuant and we can’t see any pattern in misses. However, we can see that some values weren’t traced from the beginning. E.g. neo_phcord_txt was monitored from April 2010 and del_in_intraca even later – in March 2011.

Overall we can see that the completness raised in the IIQ of 2010, when neo_phcord started to be monitored. Then it remained between 78% and 86% until october 2014. We can see a huge drop in the end of 2011, confirmed by low score in neo_apgar_num, del_type_a and neo_phcord_txt completness. Also, there is also even bigger drop in the end – November 2014, when it reached only 68%. We can see that neo_apgar_num, del_type_a are mostly incomplete in this month and del_type_ane and neo_weight num have significantly dropped.


6 Consistency

We are going to analyze two multivariate consistency rules in the data. For this we will use the datacheck library (Tools for Checking Data Consistency), which allows us to write logical rules in R language, using the variable names of our data. These rules will be written in the attached file rules.R, the first one is provided as an example, and the second one should be coded based on the provided natural language expression rule.

# We read the rules file
rules = read_rules("rules.R")
# We evaluate the rules on our data
profile = datadict_profile(repository, rules)
# We show the account of errors for each rule
knitr::kable(profile$checks[,c(1,3,6)])
Variable Rule Error.sum
DEL_MODE !(sapply(DEL_MODE, tolower) %in% “elective caesarea”) | (sapply(DEL_TYPE_ANE, tolower) %in% c(“epidural anesthesia”, “epidural anesthesia / general anesthesia”, “spinal anesthesia”, “spinal anesthesia / general anesthesia”, “general anesthesia”)) 220
DEL_MODE !(sapply(DEL_MODE, tolower) %in% c(“emergency caesarea”, “intrapartum caesarea”)) | (sapply(DEL_TYPE_ANE, tolower) %in% c(“epidural anesthesia / general anesthesia”, “spinal anesthesia”, “spinal anesthesia / general anesthesia”, “general anesthesia”)) 526
# We list the cases that have been marked as inconsistent for each rule
knitr::kable(profile$checks[,c(1,7)])
Variable Error.list
DEL_MODE 472,500,525,583,636,638,654,683,698,699,729,793,802,811,821,838,846,865,866,890,918,969,975,985,987,1008,1010,1013,1066,1076,1082,1129,1150,1190,1194,1241,1246,1270,1300,1367,1383,1384,1388,1402,1411,1428,1433,1445,1446,1453,1459,1501,1508,1531,1559,1575,1578,1612,1616,1635,1665,1667,1669,1674,1685,1692,1719,1753,1770,1774,1781,1784,1787,1790,1791,1800,1802,1836,1839,1840,1855,1857,1861,1874,1883,1902,1925,1928,1933,1939,1950,1963,1970,1972,1984,1987,2000,2001,2010,2049,2055,2058,2066,2102,2122,2134,2135,2145,2169,2251,2278,2292,2322,2334,2348,2368,2369,2386,2401,2409,2457,2474,2493,2518,2530,2550,2551,2563,2583,2591,2592,2623,2624,2632,2647,2676,2679,2680,2686,2706,2716,2732,2762,2767,2778,2822,2857,2865,2883,2901,2907,2920,2923,2934,2935,2943,2963,2965,2981,2985,3002,3006,3016,3018,3024,3029,3039,3073,3099,3106,3107,3119,3126,3158,3179,3183,3188,3201,3216,3221,3222,3240,3254,3256,3259,3323,3344,3356,3359,3363,3365,3380,3391,3410,3418,3427,3443,3484,3519,3531,3550,3553,3560,3575,3592,3603,3618,3622,3627,3630,3638,3648,3660,3670,3676,3677,3698,3704,3707,3761
DEL_MODE 58,60,66,157,165,167,182,224,251,256,270,301,322,323,347,350,375,382,392,395,428,463,473,484,492,495,503,508,512,521,524,526,534,538,539,543,552,556,572,576,585,592,597,601,604,607,616,617,618,633,635,640,641,643,652,658,660,663,669,677,680,684,688,689,696,702,707,710,711,712,714,717,733,740,742,748,759,761,763,767,768,770,776,779,782,791,795,797,804,814,828,836,839,842,844,847,857,874,878,882,888,892,901,903,910,911,914,917,919,925,926,930,931,935,941,946,961,971,977,998,1007,1017,1022,1023,1024,1025,1032,1033,1035,1048,1050,1051,1052,1053,1056,1060,1062,1067,1072,1078,1084,1086,1088,1095,1103,1110,1111,1124,1126,1131,1132,1133,1135,1137,1139,1140,1143,1144,1146,1148,1163,1171,1179,1184,1187,1188,1191,1197,1199,1210,1214,1216,1220,1224,1234,1237,1244,1250,1255,1257,1258,1268,1273,1278,1282,1283,1286,1291,1294,1295,1297,1299,1308,1321,1329,1337,1353,1355,1357,1360,1361,1369,1376,1378,1380,1382,1385,1395,1400,1412,1413,1414,1417,1421,1426,1434,1438,1440,1442,1450,1456,1461,1464,1465,1472,1473,1476,1477,1479,1481,1488,1496,1504,1512,1519,1522,1537,1565,1568,1581,1583,1600,1604,1608,1610,1624,1626,1628,1632,1637,1638,1640,1658,1661,1681,1697,1698,1708,1710,1723,1736,1742,1746,1755,1762,1769,1795,1804,1807,1817,1820,1821,1824,1826,1831,1845,1860,1866,1868,1870,1899,1914,1916,1923,1924,1931,1936,1944,1946,1952,1960,1976,1981,1988,1994,1995,2004,2005,2011,2012,2022,2027,2034,2062,2076,2079,2082,2090,2095,2106,2123,2132,2151,2152,2157,2164,2165,2180,2182,2183,2200,2204,2213,2215,2216,2228,2229,2232,2235,2243,2264,2266,2275,2279,2293,2297,2300,2341,2370,2383,2392,2412,2419,2421,2423,2435,2441,2462,2464,2489,2494,2505,2515,2526,2533,2538,2539,2541,2542,2554,2559,2566,2581,2585,2588,2601,2602,2607,2617,2627,2665,2671,2672,2675,2689,2692,2699,2709,2711,2718,2725,2741,2753,2784,2785,2794,2800,2803,2807,2817,2823,2825,2833,2849,2855,2856,2878,2893,2894,2898,2899,2909,2933,2945,2946,2948,2953,2964,2977,2988,2994,3000,3001,3003,3011,3020,3022,3023,3027,3036,3041,3052,3074,3083,3085,3095,3096,3098,3101,3103,3111,3114,3116,3127,3128,3129,3131,3132,3140,3142,3146,3149,3150,3160,3163,3172,3177,3192,3193,3194,3204,3212,3213,3214,3223,3228,3230,3235,3236,3243,3255,3265,3267,3274,3275,3276,3283,3286,3301,3334,3340,3342,3347,3348,3358,3361,3383,3389,3390,3398,3400,3431,3438,3459,3470,3488,3496,3500,3508,3514,3517,3525,3529,3533,3536,3557,3562,3564,3571,3572,3590,3593,3600,3616,3617,3628,3631,3643,3646,3647,3653,3659,3666,3671,3672,3673,3674,3679,3684,3688,3689,3693,3702,3703,3756,3797
repository[c(58, 60, 350, 270,3703),c("DEL_MODE", "DEL_TYPE_ANE")]
##                  DEL_MODE        DEL_TYPE_ANE
## 58     Emergency Caesarea Epidural anesthesia
## 60     Emergency Caesarea Epidural anesthesia
## 350    emergency caesarea       No anesthesia
## 270    emergency caesarea       No anesthesia
## 3703 Intrapartum Caesarea Epidural anesthesia
746/3801
## [1] 0.1962641

📝 DISCUSS RESULT

For these two rules we found 746 inconsistencies among 3801 examples. It’s almost 20% of the data. Each of these cases should be checked and fixed or deleted. It’d require a lot of work and we’d loose a lot of data. Of course each rule has exceptions, but if we’d like to use the data for predictions, we’d rather like to predict average case and not outliers.


7 Temporal variability

We are going to analyze if there are pattern changes in the data distributions over time. To do this we will use the EHRtemporalVariability library (Delineating Temporal Dataset Shifts in Electronic Health Records). First, we change to basic type Date the case date, originally in text format:

repositoryFormatted <- EHRtemporalVariability::formatDate(
              input         = repository,
              dateColumn    = "BIRTH_DATE",
              dateFormat = dateFormat
             )

We obtain the temporal maps from the already formatted repository using the function estimateDataTemporalMap and selecting a monthly period. We can get the help of the function by typing ?estimateDataTemporalMap in the console (it is only necessary to pass the data, the column with the date, and the desired period, the rest is obtained automatically from the data).

probMaps <- estimateDataTemporalMap(
  repositoryFormatted, "BIRTH_DATE", "month")

Next we obtain the information geometry plots from the previously estimated temporal maps. To do this we will use the function estimateIGTProjection on the list of temporal maps. We are going to save the results in a variable.

igtProjs <- sapply( probMaps, estimateIGTProjection )
names( igtProjs ) <- names( probMaps )

We can observe as an example the data temporal map and information geometry temporal (IGT) plot of the variable type of anesthesia during delivery DEL_TYPE_ANE:

plotDataTemporalMap(probMaps$DEL_TYPE_ANE)
plotIGTProjection(igtProjs$DEL_TYPE_ANE, trajectory = TRUE)

In this example, we can see that over time there are changes or differences associated with synonymy of terms (No anesthesia, no, Without anesthesia), or even differences between upper and lower case (spinal, Spinal).

Next, we are going to save the results of the temporal variability estimation in a file, so that we can upload them to the web application EHRtemporalVariability, in the Load your RData tab.

save(probMaps, igtProjs, file = "variabilityData.RData")

Either using the web application, or obtaining the graphs in RStudio, we will study the temporal evolution of the type of delivery variable DEL_MODE, and discuss the results and their potential implications in the problem defined at the beginning of the practice on the monitoring of the early onset of lactation indicator, as well as propose a possible solution.


📝 DISCUSS RESULTS

To see the results we’ve used the web application. We’ve also looked into these articles to learn how to interpret IGT plots: article1, article2

Generally, looking at the IGT plot we can distinguish three temporal subgroups – untill march 2009 (including) and then before and after march 2011. March 2011 represents the transition between two groups. The changes were rather among the years, we can’t see sesonality in the data. We have one outlier – November 2014. It resembles data from the beginning of 2009. Maybe it begins the new transition, because we see that December 2014 is also moved a bit closer to the 2009 subgroup. However, we don’t have data to determine it.

The look at the heatmap clarifies the differences. In the beginning of 2009 we had a lot of vaginal deliveries, many emergency cesareas and some vaccum deliveries and elective cesareas. The trend remained the same till March 2011, but the labels switched to lowercase. After March 2011 the labels switched back to first letter uppercase. Moreover, emergency cesareas were replaced by intrapartum cesareas and the trend for elective cesareas was stronger. November 2014 is an outlier, because there were a lot of empty string values at the time. In December 2014, we didn’t see many cesareas, what may cause the lean towards 2009s.


8 Acknowledgments

The maternal and infant data studied have been kindly provided by Ricardo García de León for educational purposes only. Their use or distribution outside this subject and practice is forbidden.